Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
I am looking forward to taking part in this course. Being honest, I felt a bit intimidated at first, as I’ve only recently started using R. I have never done any coding before, and only used SPSS as my statistics software. However, I believe mastering R and learning the basics in data science are crucial, if one wishes to pursue any sort of future in academia.
I hope that this introduction to data science will help me laern R and get a grasp of what data science is. I hope to be able to incorporate the lessons I get here to my own PhD.
My repository: https://github.com/OlanderRasmus/IODS-project
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Wed Nov 25 10:06:44 2020"
date()
## [1] "Wed Nov 25 10:06:44 2020"
We will analyse a dataset consisting of answers to a survey on the attitude of student’s towarsd learning.
We begin by setting our working directory and reading in the data file and exploring the dimensions and structure of the dataset
#We'll first set the working directory.
setwd("~/Desktop/IODS2020/IODS-project")
#And then check that it is correct.
getwd()
## [1] "/Users/rasmusolander/Desktop/IODS2020/IODS-project"
#We'll then read the file and examine it.
learning2014 <- read.table(file = "learning2014.txt")
dim(learning2014)
## [1] 166 7
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
As we can see, the data set consists of 166 observations of 7 variables. The column “gender” gives the respondents gender (M = male, F= Female) and the column “Age” their age in years. The column “attitude” describes the respondents attitude towards statistics and the column “points” their totalt points in an exam. The columns “deep”, “stra” and “sur” give the mean points to questions about deep, strategic and surface learning, respectively.
Let’s check the summary of the table, to know how the data is distributed.
summary(learning2014)
## gender Age Attitude Points
## Length:166 Min. :17.00 Min. :14.00 Min. : 7.00
## Class :character 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:19.00
## Mode :character Median :22.00 Median :32.00 Median :23.00
## Mean :25.51 Mean :31.43 Mean :22.72
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:27.75
## Max. :55.00 Max. :50.00 Max. :33.00
## deep surf stra
## Min. :1.583 Min. :1.583 Min. :1.250
## 1st Qu.:3.333 1st Qu.:2.417 1st Qu.:2.625
## Median :3.667 Median :2.833 Median :3.188
## Mean :3.680 Mean :2.787 Mean :3.121
## 3rd Qu.:4.083 3rd Qu.:3.167 3rd Qu.:3.625
## Max. :4.917 Max. :4.333 Max. :5.000
Next we will examine the data visually. We’re interested in what variables that might affect points in the questionnaire, but we’ll start by examining the whole data set.
#We first accesses the library GGally and ggplot2.
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
#And then draw a scatter plot matrix (p1).
p1 <- ggpairs(learning2014, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
p1
Based on this, points seem to be correlated the strongest (positively) with attitude (correlation coefficient 0.437, followed by strategic learning (0.146) and negatively correlated with surface learning (-0.144).
We’ll create a linear model including all these three variables and check these.
my_model <- lm(Points ~ Attitude + stra + surf, data = learning2014)
summary(my_model)
##
## Call:
## lm(formula = Points ~ Attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## Attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
The model predicts linearily that Points = 11,01711 + 0.33952 * Attitude + 0.8.5313 * Mean strategic score - 0.58607* Mean surface learning score.
The model has a an R^2 of 0.2074 (i.e explains 20.74 % of Points) and is in itself significant (p<0.05), however neither the mean strategic score nor the mean surface score are signficant. Let’s remove these.
new_model <- lm(Points ~ Attitude, data = learning2014)
summary(new_model)
##
## Call:
## lm(formula = Points ~ Attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Now the model has a R^2 of 0.1906 and is still signficant (p<0.05). The R^2 value means that it is able to explain 19.06 % of the variance in the points.
Next we’ll check the diagnostics, i.e. that the assumptions of linear regression have not been broken. This is easiest done visually.
plot(new_model, which = c(1, 2, 5))
The first plots shows the residuals vs the fitted values, i.e. how far away each data point is from our predicted model. THe data point should be equally far from our predicted model throughout the model (homoscedasticity). The residuals are equally distributed througout the predicted values.
The second plot shows how residuals are distributed. The better they follow the Q-Q line, the more normal is their distributed. The residuals are sufficiently normal.
The third plot shows the residuals vs leverage. The further to the right a point is, the more leverage it has, i.e. affects the model. This is useful to detect significant outliers. THe model does not suffer form significant outliers,
date()
## [1] "Wed Nov 25 10:06:50 2020"
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
alc <- read.csv(file= "data/alc.csv", header = T)
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The dataset describes student achievement in secondary education of two Portuguese schools and includes, among other things, performance, abscences and alcohol use. The dataset is derived from the performance in two separate classes, Portuguese and mathematics.
I will be looking into the effect of alcohol use on time spent studying (‘studytime’), number of failures (‘failures’), absences (‘absences’) and final grade (‘G3’. The hypothesis is that higher alcohol use is negatively associated with studytime and final grade, while positively associated with the number of failures and absences.
It is good to first examine the data graphically. Alcohol use is a numeric variable ranging from 1-5, time spent studying an ordinal variable on a range from 1-4, the number of failures ordinal on a range 1-4, absences numeric (n of abscenes) and final grade is numeric as well ranging from 0-20.
p1 <- ggplot(alc, aes (x=alc_use, y = studytime)) + geom_bar(stat="identity")
p1
p2 <- ggplot(alc, aes (x=alc_use, y = failures)) + geom_bar(stat="identity")
p2
p3 <- ggplot(alc, aes (x=alc_use, y = absences, group = alc_use)) + geom_boxplot()
p3
p4 <- ggplot(alc, aes (x=alc_use, y = G3, group = alc_use )) + geom_boxplot()
p4
Studytime is difficult to spot as a trend from the graph, same for failures. However, the number of absences seems to increase with increased alcohol consumption, while the final grade decreases.
Let’s look at the means per category of alcohol use.
alc %>%
group_by(alc_use) %>%
summarise(st = mean(studytime), fail = mean(failures), ab = mean(absences), grade = mean(G3))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 9 x 5
## alc_use st fail ab grade
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2.31 0.107 3.36 12.0
## 2 1.5 1.97 0.145 4.23 11.6
## 3 2 1.98 0.220 3.92 11.3
## 4 2.5 1.93 0.136 6.43 11.7
## 5 3 1.72 0.562 6.09 10
## 6 3.5 1.47 0.471 5.65 10.2
## 7 4 1.78 0.222 6 10.2
## 8 4.5 2 0 12 10.7
## 9 5 1.67 0.556 6.89 10.9
In addition to the earlier noted trends in abscences and grades, failures seems to incerease with increased alcohol consumption, while studytime decreases.
Next, lets explore the relationship between high alcohol use and studytime, failures absences and grade, and create odds ratios with 95% CI’s.
m1 <- glm(high_use ~ studytime + failures + absences + G3, data = alc, family = "binomial")
summary(m1)
##
## Call:
## glm(formula = high_use ~ studytime + failures + absences + G3,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1570 -0.8393 -0.6605 1.1103 2.1227
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.02933 0.56149 0.052 0.958346
## studytime -0.47423 0.15920 -2.979 0.002893 **
## failures 0.29443 0.20422 1.442 0.149383
## absences 0.07766 0.02271 3.420 0.000626 ***
## G3 -0.03519 0.03868 -0.910 0.362964
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 429.15 on 377 degrees of freedom
## AIC: 439.15
##
## Number of Fisher Scoring iterations: 4
coef(m1)
## (Intercept) studytime failures absences G3
## 0.02932628 -0.47423143 0.29442757 0.07765857 -0.03518531
OR1 <- coef(m1) %>% exp
CI1 <- confint(m1) %>% exp
## Waiting for profiling to be done...
cbind(OR1, CI1)
## OR1 2.5 % 97.5 %
## (Intercept) 1.0297605 0.3412502 3.1040114
## studytime 0.6223632 0.4513693 0.8437505
## failures 1.3423577 0.9000830 2.0158575
## absences 1.0807536 1.0359926 1.1326415
## G3 0.9654265 0.8948227 1.0417846
By looking at the summary of the model, we can see that failures and G3 are not significant predictors. This becomes even more clear when examining the ORs and their CIs. The ORs and their 95 CIs are as follows: studytime 0.62 (0.45-0.84), failures 0.62 (0.45-0.84), absences 1.08 (1.04-1.13) and G3 0.97 (0.89-1.04).
Both failures and G3 include one within their 95 % CI and are not significant predictors, while studytime and absences are.
Thus are hypothesis that increased alcohol consumption is related to an increase in failures and decrease in G3 is not necesseraily true, while there seems to be an association between increased alcohol use and decreased time spent studying and increased absences.
m2 <- glm(high_use ~ studytime + absences, data = alc, family = "binomial")
summary(m2)
##
## Call:
## glm(formula = high_use ~ studytime + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2128 -0.8387 -0.7046 1.1996 2.1832
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.16640 0.33954 -0.490 0.624087
## studytime -0.55015 0.15550 -3.538 0.000403 ***
## absences 0.08054 0.02285 3.524 0.000425 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 433.65 on 379 degrees of freedom
## AIC: 439.65
##
## Number of Fisher Scoring iterations: 4
coef(m2)
## (Intercept) studytime absences
## -0.16639583 -0.55015181 0.08054463
OR2 <- coef(m2) %>% exp
CI2 <- confint(m2) %>% exp
## Waiting for profiling to be done...
cbind(OR2, CI2)
## OR2 2.5 % 97.5 %
## (Intercept) 0.8467110 0.4349166 1.6508083
## studytime 0.5768622 0.4211716 0.7758799
## absences 1.0838772 1.0386751 1.1361111
The model looks good, with both studytime and absences as signficant predictors. Now, let’s move to predicting. Let’s first examine the prediction using crosstabs and plotting the values.
probabilities <- predict(m2, type = "response")
alc <- mutate(alc, probability= probabilities)
alc <- mutate(alc, prediction = probability >0.5)
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 256 12
## TRUE 96 18
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point()
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67015707 0.03141361 0.70157068
## TRUE 0.25130890 0.04712042 0.29842932
## Sum 0.92146597 0.07853403 1.00000000
The model predicts 256 correctly as false, 18 as correctly true, while predicting 96 cases as true when in reality false, and 12 as false when in reality true. This can be seen graphically as well.
Let’s further examine the model using the penalty/loss function.
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2827225
Based on the loss function we can see that the model incorrectly predicts 0.28 of the data . A simple guessing strategy would give a loss of 0.50, indicating that our model is better than pure guess. Still, 0.28 is quite a large error.
date()
## [1] "Wed Nov 25 10:06:52 2020"
library(dplyr)
library(ggplot2)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
Task 2
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
The Boston dataset consists of 506 rows, 14 columns, and describes housing values in suburbs of Boston, published by Harrison D. et Rubinfield, D.L. in 1978, and Belsely et al. 1980. The data contains a set of variables per town, including per capita crime rate, zoning data, industry, demographics and housing.
Task 3
I chose to examine tha dataset using a correlation plot, and to examine the summary of the data.
library(corrplot)
## corrplot 0.84 loaded
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
cor_matrix <- cor(Boston) %>% round (2)
cor_matrix
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
From the correlation plot and matrix we can see that, for instance, index of accessibility to radial highways (rad) is highly correlated with the full-value property-tax rate per/$10,000 (0.91). And the weighted mean of distances to five Boston employment centres (dis) is highly negatively correlated with the proportion of owner-occupied units built prior to 1940 (age) (-0.75), nitrogen oxides concentration (parts per 10 million) (nox) and the proportion of non-retail business acres per town (indus), correlation coefficients of -0.75,-0.77 and -0.71, respectively. From the summary view we can see that the distribution of variables varies greatly for each variable. However, the dummy variable chas (whether the Charles River is next to the town), is oddly described in this summary.
Task 4
Part 1, scaling the dataset and saving it as a data.frame.
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
class(boston_scaled)
## [1] "matrix" "array"
boston_scaled <- as.data.frame(boston_scaled)
From the summary, we can see that is now scaled in line to its own mean.
Next, we’re crating the a categorical variable of the crime rate, and drop the old crime rate variable from the dataset. I personally find this part somewhat confusing, which is why I kept the comments and code from datacamp intact.
# summary of the scaled crime rate
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins , include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
Finally, we’re dividing the dataset in training and testing sets, with 80% of the data in the train set.
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
Task 5
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2450495 0.2500000 0.2351485 0.2698020
##
## Group means:
## zn indus chas nox rm age
## low 1.09618215 -0.9579728 -0.07348562 -0.9125882 0.47953750 -0.9263576
## med_low -0.08606302 -0.2723132 -0.03844192 -0.5508797 -0.08009761 -0.3473981
## med_high -0.38794577 0.1550573 0.22498886 0.3884286 0.08169057 0.4082747
## high -0.48724019 1.0169738 -0.05560795 1.0644916 -0.41829387 0.8152711
## dis rad tax ptratio black lstat
## low 1.0004359 -0.6883741 -0.7285796 -0.4670275 0.37628555 -0.80148690
## med_low 0.2834409 -0.5611457 -0.4505840 -0.0732138 0.30789178 -0.15712974
## med_high -0.4182925 -0.4293981 -0.3377379 -0.2532004 0.07756901 0.02312643
## high -0.8540894 1.6395837 1.5150965 0.7824713 -0.77952971 0.94875277
## medv
## low 0.54919810
## med_low 0.01064321
## med_high 0.15678093
## high -0.74612226
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.13026928 0.639082660 -0.897140865
## indus 0.05396296 -0.201943608 0.308092160
## chas -0.09110441 -0.063498643 0.036321413
## nox 0.38464976 -0.605902021 -1.483675382
## rm -0.13910738 -0.012398559 -0.004036808
## age 0.19684403 -0.213225969 -0.329878475
## dis -0.07730313 0.094866660 -0.072903538
## rad 3.45441103 0.871348478 -0.177674951
## tax -0.08278407 0.131413971 0.742573715
## ptratio 0.12630478 0.007397766 -0.318687015
## black -0.10562998 0.028737677 0.153052294
## lstat 0.22566774 -0.162330931 0.341934368
## medv 0.16434807 -0.249436703 -0.325923537
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9529 0.0356 0.0115
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
Task 6
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 11 15 2 0
## med_low 3 18 4 0
## med_high 1 6 22 2
## high 0 0 0 18
Looking at the results, our model correctly predicted 13 as low, 13 as medium low, 18 as medium high and 28 high, giving in total 72 correct predictions. However, the model also predicted 30 wrong, predicting especially low cases and medium high cases as medium low.
Task 7
Part 1: calculating the distances.
data("Boston")
boston_scaled2 <- scale(Boston)
boston_scaled2 <- as.data.frame(boston_scaled2)
# euclidean distance matrix
dist_eu <- dist(boston_scaled2)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
Part 2: K-means algorithm, visualising it (not needed), and investigating the optimal number. Note that the seed is set to 123, as K-means might prouce different results every time.
set.seed(123)
km <-kmeans(boston_scaled2, centers = 4)
pairs(boston_scaled2, col = km$cluster)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
The optimal number of clusters is when the total WCSS changes radically. In this case, it seems to be 2, as after that, the slope flattens for each category.
Rerunning the clustering, with 2 clusters.
km <-kmeans(boston_scaled2, centers = 2)
pairs(boston_scaled2, col = km$cluster)
Examining the plots, the clustering seems to be doing all right for each variable. However, in order to looking for closer trends I would like to have the knowledge of what groups we are typically looking for in this sort of data and compare the two predicted clusters to some sort of hypothesis.
library(dplyr)
library(ggplot2)
library(GGally)
library(corrplot)
human <- read.csv("data/human.csv", row.names = 1)
Task 1.
Let’s examine the data graphically.
ggpairs(human)
cor_matrix <- cor(human)
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)
Examining the scatterplots, there seems to be strong, linear associaton between expected education and life expectancy, and adolescent birth rate and maternal mortality. The association between life expectancy and GNI seems almost exponential. The strongest correlation overall is the one between maternal mortality and life expectancy, which is rather expected.
Let’s look at the summary of the data:
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
As we can see from the graphical distribution, and from the summaries, quite a lot of the data is rather skewed. Especially GNI, maternal mortality and adolescent birth rates have a rather large rightward tail, while life expectancy, education expectancy and female-male ration in the labour force all have leftward tails. This is confirmed from the summary view, as the mean and median of all mentioned variables are quite far apart (especially so for GNI, maternal mortality and adolescent birth rate).
#Task 2.
pca_human <- prcomp(human)
biplot(pca_human, choices = 1:2, cex = c(0.5, 0.5))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
Figure 1. Biplot of the principal component analysis on data from the UN human development program, unstandardised.
s <- summary(pca_human)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
pca_pr <- pca_pr <- round(100*s$importance[2,], digits = 1)
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
If we examine the biplot, we can see, that gross national income per capita (GNI) completely dominates everything else. In fact, it is parallel to the PC1 axis, showing that they are highly parallel. Examining the proportion of variance, we can see that PC1 stands for 99.999 % of the variance, while PC2 stands for 0.001 %, while the rest don’t really add at all. This rounds down to 100 % for PC1.
#Task 3.
Let’s do the same as above, but now standardising the data.
human_std <- scale(human)
summary(human_std)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
pca_human_std <- prcomp(human_std)
biplot(pca_human_std, choices = 1:2, cex = c(0.5, 0.5))
Figure 2. Biplot of the principal component analysis on the same data as above, but standardised.
s_std <- summary(pca_human_std)
s_std
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
pca_pr_std <- round(100*s_std$importance[2,], digits = 1)
pca_pr_std
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
The results differ markedly from those above. This is caused by the standardisation of the data, which removed the dominating effect of GNI, allowing us to see the effect of the other variables as well. Looking PC1 and PC2, they now explaining 53.6% and 16.2 %, respectively. Now, for instance, we can see that GNI, education and life expectancy, all “pull in the same direction”, and all are highly negatively correlated with maternal mortality and adolescent birth rate. This is to be expected, as we might reasonably assume, that people in countries with a higher gross national income per capita (GNI) have a higher life expectany and are expected to more highly educated, while at the same time the opposite is true for countries with higher maternal mortality and adolescent birth rate. In fact, we can also see the logic behind a higher adolescent birth rate and maternal mortality being related - as they can both be logically assocaited with a lower access for women to healtchare services.
#Task 4.
My interpretation is that GNI is such a dominating factor, that it hides everything else within it. Alone, it would explain nearly all of the variance, as seen in task 2, where it was parallel to PC1 which explained more than 99.9 % of the variance. However, when standardising, we can see that it is not really the greatest explainer of variance, with its arrow being shorter than any one of the other arrows.
#Task 5.
library("FactoMineR")
data(tea)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
The dataset tea consists of 300 observations of 36 variables, based on a questionnaire on how individuals drink tea (18 questions), product’s perception (12 question) and personal details (4 questions). In addition, age has been divided into classes. I was unable to find information on what the 36th variable would be, but I assume it is frequency of drinking tea.
I’ll examine the columns containing type of tea (“Tea”), whether drunk tea alone or with milk or lemon (“How), whether tea bags or unpackaged, or obth (”how“), use of sugar (”sugar“), where tea is bought from”where“, and whether it is drunk at lunch or not (”lunch").
library(dplyr)
library(tidyr)
library(ggplot2)
library(FactoMineR)
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- dplyr::select(tea, one_of(keep_columns))
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
The data is visualised and summarised above. We can see that that the data is split into three groups for “how”, “Tea” and “where”, in two groups for “lunch” and “sugar” and four groups for “How”. The distribution is seen above.
Next, we’ll do an MCA on the data.
# multiple correspondence analysis and visualize
mca <- MCA(tea_time, graph = T)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = T)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
We’ll start by examining the summary of the MCA. Notably, from the Eigenvalues table we can see that dimensions 1 and 2 explain around 29 % of the variance, with dimension 1 explaining 15 % and dimension 2 14%. Looking at the table of the categorical variables, we can see that “how” and “where” are correlated with dimensions 1 and 2, being surpassed by “Tea” and “How” for dimension 3.
The variables representation plot shows the same information graphically, and we can in fact see that “how” and “where” are correlated quite the same for dimensions and 2. Thus we can presume, that whether tea is bought in a tea shop or supermarket and whether it is made from tea bags or unpackaged tea seems to tell our gropus apart.
Finally, examining the MCA factor map for the variables, we can see that the same people who buy their tea unpackaded are likely to buy it from a tea shop, and the tea is quite often green tea. Similarly, tea bags and chain stores seems to similar, and the people who buy both tea bags and unpackaged tea do it both from tea shops and chain store. Quite intuitive! We’ll leave the MCA factor map for the indivuals as it is - it doesn’t really tell us that much about the whole.